Introduction

Welcome! The purpose of this project is to produce an algorithm (model) that can help the online real-estate website, Zillow.com, predict home sale prices with greater accuracy. Improving real-estate valuations are important for several reasons. Namely it:

  1. Improves user experience
  2. Provides context for anticipated property taxes and homeowner’s insurance
  3. Alleviates historic and systemic biases commonplace with home valuations in neighborhoods of color

Data

To gather data, our team focused on Mecklenberg County, NC (Charlotte Metro Area) and sourced information from the county’s open data website, as well as the American Community Survey and U.S.Census.

Charlotte Home Sales Data

To begin, we will import a home sales dataset that includes variables like location, housing characteristics, and home quality for the Charlotte Metro Area. After, we will ‘clean’ our data by creating useful columns such as, “building grade” as a numeric value (where higher values correspond to greater quality), “age of home (age)”, “price per square foot (pricesqft)” and calculating the # of “total baths (totbaths)” by joining full and half-bathroom information. Moving forward, we’ll refer to this home sales data as “internal variables.”

#Import sales data, remove columns we won't need, add useful columns, set CRS for North Carolina
library(dplyr)
library(sf)
CLT_internal <- 
  st_read("https://github.com/mafichman/MUSA_508_Lab/raw/main/Midterm/data/2022/studentData.geojson") %>%
  st_transform('ESRI:103500')

CLT_internal <- CLT_internal[c(5,9,20,21,26,28,30:46,57:60,67,68,70,71,72)] %>%
  mutate(
    age = 2022 - yearbuilt,
    sqft = (totalac * 43560),
     pricesqft = ((totalac * 43560)/price),
        totbaths = (halfbaths*0.5)+(fullbaths))

  CLT_internal$quality <- recode(CLT_internal$bldggrade, MINIMUM = 1 , FAIR = 2, AVERAGE = 3, GOOD = 4, VERYGOOD = 5, EXCELLENT = 6, CUSTOM = 7)

Adding Amenities Data

To build a strong algorithm (model), it’s important to include variables that relate to the housing market such as local schools, grocery stores, and parks. We’ll refer to these variables as “amenities.”

# Adding school data 
CLT_schools <- 
  st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Schools.geojson") %>%
  st_transform(st_crs(CLT_internal))

# Adding grocery store data
CLT_grocery <- 
  st_read("Grocery_pts.geojson") %>%
  st_transform(st_crs(CLT_internal))

# Adding parks data 
CLT_parks <- 
  st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Parks.geojson") %>%
  st_transform(st_crs(CLT_internal))

Adding Spatial Structure Data

Finally, we will add variables that provide demographic and environmental data for the Charlotte Metro Area. Specifically, we will include educational attainment, household income, and neighborhoods data. We’ll refer to these variables as “spatial structure.”

# Adding demographic data
CLT_demo <- 
  get_acs(geography = "tract", 
          variables = c("B19013_001E", "B15003_022E","B15003_001E"), 
          year=2020, state=37, county=119, 
          geometry=TRUE, output="wide") %>%
  st_transform(st_crs(CLT_internal)) %>%
  dplyr::select( -NAME, -B19013_001M, -B15003_022M, -B15003_001M)

CLT_demo <-
  CLT_demo %>%
  rename(HH_inc = B19013_001E, 
         College = B15003_022E,
         College_age_pop = B15003_001E) %>%
  mutate(college_perc = College/College_age_pop) %>%
  dplyr::select( -College, -College_age_pop)

# Adding neighborhood data 
CLT_neighborhoods <- 
  st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/School_districts.geojson") %>%
  st_transform(st_crs(CLT_internal))

Orienting Our Variables

So far, we have added internal, amenities, and spatial structure variables. However, in order to build our model and analyze how these variables relate to home sales, we must modify them. We’ll achieve this using 2 techniques:

1. K-nearest neighbor (KNN): this will find the distance between a given home and the most near amenities (school, grocery store, park).

2. Spatial join (SJ): this will join our spatial structure data (educational attainment, neighborhoods) to our internal varies (Charlotte homes sales)

*
Note to instructor: the nn_function did not work as normal, perhaps due to the geometry featuring multiple points (versus just X and Y), so we took the central point of each feature.
# Most near school, grocery store, and park 
CLT_internal <-
  CLT_internal %>% 
    mutate(
      school_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_schools)),k = 1),
      grocery_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_grocery)), k = 1),
       park_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_parks)), k = 1))

# Spatial join 
CLT_internal <- 
  st_join(CLT_internal,CLT_demo) 

CLT_internal <- 
  st_join(CLT_internal, CLT_neighborhoods)

Summary Statistics

Below are summary statistics tables for each variable category (internal, amenities,spatial structure).

#Internal variables 
ss_internal <- CLT_internal
ss_internal <- st_drop_geometry(ss_internal)
ss_internal <- ss_internal %>%
  dplyr::select("sqft","pricesqft", "totbaths", "yearbuilt") 
stargazer(as.data.frame(ss_internal), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
## 
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## =======================================================
## Statistic   N      Mean    St. Dev.   Min      Max     
## -------------------------------------------------------
## sqft      46,183 77,176.9 4,648,606.0 0.0 425,034,086.0
## pricesqft 46,138  Inf.0               0.0     Inf.0    
## totbaths  46,183   2.6        0.9     0.0     11.0     
## yearbuilt 46,183 1,993.4     31.8      0      2,022    
## -------------------------------------------------------
#Amenities 
ss_amenities <- CLT_internal
ss_amenities <- st_drop_geometry(ss_amenities)
ss_amenities <- ss_amenities %>%
  dplyr::select("school_nn1", "grocery_nn1", "park_nn1") 
stargazer(as.data.frame(ss_amenities), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables")
## 
## Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables
## ================================================
## Statistic     N     Mean   St. Dev. Min    Max  
## ------------------------------------------------
## school_nn1  46,183 3,266.8 1,390.4  8.2  8,369.4
## grocery_nn1 46,183 1,737.8 1,274.4  37.0 8,599.7
## park_nn1    46,183 1,239.8  767.7   21.0 5,026.3
## ------------------------------------------------
# Spatial Structure
ss_spatial <- CLT_internal
ss_spatial <- st_drop_geometry(ss_spatial)
ss_spatial <- ss_spatial %>%
  dplyr::select("HH_inc", "college_perc", "FID") 
stargazer(as.data.frame(ss_spatial), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
## 
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## ====================================================
## Statistic      N      Mean   St. Dev.  Min     Max  
## ----------------------------------------------------
## HH_inc       46,180 86,175.7 37,415.7 17,685 238,750
## college_perc 46,180   0.3      0.1     0.03    0.6  
## FID          46,183   26.0     10.6     0      43   
## ----------------------------------------------------

Correlation Matrix

Below is a table visualizing correlation between our variables. We can see the home price maintains a postive correlation with the following variables (in order of strength):

  • heated square footed of the home
  • home quality
  • total bathrooms
  • household income
  • percentage of residents with college degrees
  • bedrooms
  • number of fireplaces in the home

We can use this matrix to inform our variable selection for our model.

numericVars <- select_if(st_drop_geometry(CLT_internal), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1),
  p.mat = cor_pmat(numericVars),
  colors = c("deepskyblue", "grey100", "firebrick1"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation Matrix of Numeric Variables", tl.cex = 0.5, tl.col = "black") +
  theme(axis.text.x = element_text(size = 10)) +
  plotTheme()

Scatterplots

Below are 4 home price correlation scatterplots based upon the results of our correlation matrix:

st_drop_geometry(CLT_internal) %>% 
  dplyr::select(price, quality, heatedarea, HH_inc, yearbuilt) %>% 
    filter(price < 10000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "hotpink") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Internal and Spatial Variables") +
  plotTheme()

Maps

Below are 4 maps including:

1. A map of our dependent variable (price)

2. A map of park locations

3. A map of nearby grocery stores

4. A map of school quality

*
Note: the first 3 maps are in relation to home prices within the Charlotte Metro Area.

Map 1: Home Price

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey40") +
  geom_sf(data = CLT_internal, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(CLT_internal,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Home Price", subtitle="Charlotte Metro Area") +
  labs(color = "Observed Sales Price (quintiles)") +
  mapTheme()

Map 2: Parks vs Home Price

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey70") +
  geom_sf(data = CLT_internal, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(CLT_internal,"price"),
                   name="Quintile\nBreaks") +
  geom_sf(data = CLT_parks, color="darkgreen") +
  labs(title="Park Locations vs Home Price", subtitle="Charlotte Metro Area") +
  mapTheme()

Map 3: Grocery Stores vs Home Price

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey30") +
  geom_sf(data = CLT_internal, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(CLT_internal,"price"),
                   name="Quintile\nBreaks") +
  geom_sf(data = CLT_grocery, color="deepskyblue") +
  labs(title="Grocery Store Locations vs Home Price", subtitle="Charlotte Metro Area") +
  mapTheme()

Map 4: School Quality

ggplot() +
  geom_sf(data = CLT_schools, aes(fill=factor(Quality), color=factor(Quality))) +
  scale_fill_brewer(palette="RdYlGn") +
  scale_color_brewer(palette="RdYlGn") +
  labs(title="School Quality", subtitle="Niche.com ratings; Charlotte Metro Area") +
  mapTheme()

Regression Model

Dividing the Data

Now that we have identified important variables, we will build out model by dividing the data into training/test sets. Our data will be partitioned as a 60/40 train-test split. After, we’ll run a regression on our training set (60%) and use the results to determine the generalizability with our ‘test’ data.

However, before dividing the dataset we will create categorical variables for the number of floors, bathrooms, and bedrooms for a given home.

#Re-engineering data as categorical: number of floors
CLT_internal<- 
  CLT_internal%>%
  mutate(NUM_FLOORS.cat = ifelse(storyheigh == "1 STORY" | storyheigh == "1.5 STORY" | storyheigh == "SPLIT LEVEL" | storyheigh == "2.0 STORY", "Up to 2 Floors",
               ifelse(storyheigh == "2.5 STORY" | storyheigh == "3.0 STORY", "Up to 3 Floors", "4+ Floors")))
#Re-engineer bedroom as categorical
CLT_internal <- 
  CLT_internal %>%
  mutate(NUM_BEDS.cat = ifelse(bedrooms <= 2, "Up to 2 Bedrooms",
                               ifelse(bedrooms == 3 | bedrooms == 4, "Up to 4 Bedrooms", "5+ Bedrooms")))
#Re-engineer bathroom data as categorical
CLT_internal <- 
  CLT_internal %>%
  mutate(NUM_BATHS.cat = ifelse(totbaths <= 2.5, "Up to 2.5 Bathrooms",
                               ifelse(totbaths <= 3.5 | totbaths <= 4.5, "Up to 4 Bathrooms", "5+ Bathrooms")))
#Creating training data
inTrain <- createDataPartition(
              y = paste(CLT_internal$NUM_FLOORS.cat, CLT_internal$NUM_BEDS.cat), 
              p = .60, list = FALSE)
charlotte.training <- CLT_internal[inTrain,] 
charlotte.test <- CLT_internal[-inTrain,]  

reg.training <- lm(price ~ ., data = st_drop_geometry(charlotte.training) %>% 
                                    dplyr::select(price, heatedarea, 
                                               quality, NUM_FLOORS.cat,
                                               NUM_BEDS.cat, NUM_BATHS.cat, 
                                               park_nn1, grocery_nn1,
                                               age, HH_inc, college_perc))
summary(reg.training)
## 
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(charlotte.training) %>% 
##     dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat, 
##         NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age, 
##         HH_inc, college_perc))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1700590   -84337    -8355    62079 43777952 
## 
## Coefficients:
##                                      Estimate   Std. Error t value
## (Intercept)                      -294136.1494   39669.6022  -7.415
## heatedarea                           136.5569       4.4684  30.561
## quality                           197602.6787    5045.2573  39.166
## NUM_FLOORS.catUp to 2 Floors       58969.3580   21485.5243   2.745
## NUM_FLOORS.catUp to 3 Floors       35852.5896   27889.6130   1.286
## NUM_BEDS.catUp to 2 Bedrooms       54004.0648   15516.9983   3.480
## NUM_BEDS.catUp to 4 Bedrooms       39899.3091   10175.7475   3.921
## NUM_BATHS.catUp to 2.5 Bathrooms -431122.6701   24530.2029 -17.575
## NUM_BATHS.catUp to 4 Bathrooms   -439686.2605   22236.6845 -19.773
## park_nn1                              -4.2150       3.3004  -1.277
## grocery_nn1                          -21.9551       2.2168  -9.904
## age                                 2294.5786     121.2829  18.919
## HH_inc                                 0.7781       0.1164   6.687
## college_perc                       24732.5021   32799.3916   0.754
##                                              Pr(>|t|)    
## (Intercept)                         0.000000000000126 ***
## heatedarea                       < 0.0000000000000002 ***
## quality                          < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors                 0.006063 ** 
## NUM_FLOORS.catUp to 3 Floors                 0.198623    
## NUM_BEDS.catUp to 2 Bedrooms                 0.000502 ***
## NUM_BEDS.catUp to 4 Bedrooms        0.000088405034962 ***
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms   < 0.0000000000000002 ***
## park_nn1                                     0.201564    
## grocery_nn1                      < 0.0000000000000002 ***
## age                              < 0.0000000000000002 ***
## HH_inc                              0.000000000023232 ***
## college_perc                                 0.450824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 395800 on 25554 degrees of freedom
##   (2146 observations deleted due to missingness)
## Multiple R-squared:  0.3201, Adjusted R-squared:  0.3197 
## F-statistic: 925.4 on 13 and 25554 DF,  p-value: < 0.00000000000000022

Evaluating Generalizability

To test the strength (accuracy) of our model in its ability to predict prices, we will:

  1. Find the mean absolute error (MAE) + mean absolute percentage error (MAPE),
  2. Conduct cross-validation tests
  3. Plot predicted prices as a function of observed prices
  4. Map our test set residuals, including a Moran’s I test and a plot of the spatial lag in errors
  5. Map of mean absolute percentage error (MAPE) by neighborhood.
  6. Create a scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood

MAE & MAPE

#Creating predictions and calculating Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE)
charlotte.test <-
  charlotte.test %>%
  mutate(price.Predict = predict(reg.training, charlotte.test),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000)

MAE <- mean(charlotte.test$price.AbsError, na.rm = T)
MAPE <- mean(charlotte.test$price.APE, na.rm = T)

reg.MAE.MAPE <- 
  cbind(MAE, MAPE) %>%
  kable(caption = "Regression MAE & MAPE") %>%
  kable_styling("hover",full_width = F) 

reg.MAE.MAPE
Regression MAE & MAPE
MAE MAPE
109529.1 0.2704347

Cross Validation

#Cross-validation
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(CLT_internal) %>% 
                                dplyr::select(price, heatedarea, 
                                               quality, NUM_FLOORS.cat,
                                               NUM_BEDS.cat, NUM_BATHS.cat, grocery_nn1,
                                               age, HH_inc, college_perc, 
                                               park_nn1), 
     method = "lm", trControl = fitControl, na.action = na.pass)
# Table
stargazer(as.data.frame(reg.cv$resample), type="text", digits=1, title="Cross Validation Results")
## 
## Cross Validation Results
## =======================================================
## Statistic  N    Mean    St. Dev.     Min        Max    
## -------------------------------------------------------
## RMSE      100 268,328.6 296,854.0 133,413.4 2,137,604.0
## Rsquared  100    0.6       0.2      0.01        0.8    
## MAE       100 114,163.2 17,868.1  93,420.4   207,393.9 
## -------------------------------------------------------
# Plot 
ggplot(reg.cv$resample, aes(x=MAE)) +
  geom_histogram(fill = "darkgreen") +
  labs(title = "Count of Mean Average Error During Cross-Validation") +
  xlab("MAE")+
  ylab("Count")+
  plotTheme()

Visualizing Prediction Errors

#Visualizing prediction errors
charlotte.APE <- charlotte.test[c(6,36,43:46,51,54)]

charlotte_APE.sf <- 
  charlotte.APE %>%
  filter(price.APE > 0) %>%
  st_as_sf(sf_column_name=geometry) %>%
  st_transform('ESRI:103500')

grid.arrange(ncol=2,
             ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey40") +
  geom_sf(data = charlotte_APE.sf, aes(color = q5(price.Predict)), size = .25) +
  scale_colour_manual(values = palette5,
                   labels=qBr(charlotte.APE,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Predicted Sales Price") +
  mapTheme(),

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey40") +
  geom_sf(data = charlotte_APE.sf, aes(color = price.APE), size = .25) +
  labs(title="Predicted Sales Price\nAbsolute Percent Error") +
  binned_scale(aesthetics = "color",
               scale_name = "stepsn",
               palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
               breaks = c(0.10, 0.20, 0.5, 0.75),
               limits = c(0, 50),
               show.limits = TRUE,
               guide = "colorsteps"
  ) +
  mapTheme())

#Predicted vs observed sales price
ggplot(
charlotte_APE.sf, aes(price, price.Predict, col = price.APE)) +
binned_scale(aesthetics = "color",
    scale_name = "stepsn", 
    palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
    breaks = c(0.10, 0.20, 0.5, 0.75),
    limits = c(0, 50),
    show.limits = TRUE, 
    guide = "colorsteps") +
geom_point(size=1) +
scale_y_continuous(limits = c(0, 4000000)) +
scale_x_continuous(limits = c(0, 4000000)) +
labs(title="Sales Price vs. Predicted", subtitle="Charlotte Metro Area") +
ylab("Predicted Sales Price (in dollars)") +
xlab("Observed Sales Price (in dollars)") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", size = 0.5) +
labs(color = "Absolute % Error") +
geom_label(
  label="0% error line", 
  x=3500000,
  y=3000000,
  label.padding = unit(0.25, "lines"), # Rectangle size around label
  label.size = 0.15,
  color = "black",
  fill="grey80")

Evaluating Spatial Bias

By calculating Moran’s I for this dataset, we are determining if there is a spatial autocorrelation. relationship among home sales in Charlotte. As seen in the outcome, Moran’s I was calculated to be high, meaning there is a spatial relationship in the predicted errors that must be accounted for.

#Calculating Moran's I
coords.test <-  st_coordinates(charlotte.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")
 
charlotte.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK = TRUE))
## Simple feature collection with 18455 features and 54 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 422146 ymin: 141808.1 xmax: 466272.4 ymax: 196797.3
## Projected CRS: NAD_1983_CORS96_StatePlane_North_Carolina_FIPS_3200
## First 10 features:
##         pid totalac codemunici   municipali zipcode   price yearbuilt
## 1  00101333  0.5000       <NA> HUNTERSVILLE   28078 1315000      2005
## 2  00101334  0.5000       <NA> HUNTERSVILLE   28078  925000      1997
## 3  00101337  0.4155       <NA> HUNTERSVILLE   45601 2523000      2018
## 4  00101341  0.0000          6 HUNTERSVILLE   28078  250000      1980
## 5  00101433  0.0000          6 HUNTERSVILLE   28078  675000      2003
## 6  00101461  0.0000          6 HUNTERSVILLE   28078  422000      1992
## 7  00102218  2.4400          6 HUNTERSVILLE   28078  272000      1910
## 8  00102221  0.0000          6 HUNTERSVILLE   28078  587500      1988
## 9  00102318  1.2820          6 HUNTERSVILLE   28081  819500      1976
## 10 00102320  0.5950          6 HUNTERSVILLE   28202  738000      1965
##    heatedarea cdebuildin descbuildi storyheigh aheatingty heatedfuel     actype
## 1        6187         01        RES  2.0 STORY AIR-DUCTED        GAS AC-CENTRAL
## 2        2736         01        RES  1.5 STORY AIR-DUCTED        GAS AC-CENTRAL
## 3        5180         01        RES  1.5 STORY AIR-DUCTED        GAS AC-CENTRAL
## 4        1242         01        RES    1 STORY AIR-DUCTED   ELECTRIC AC-CENTRAL
## 5        2716         01        RES    1 STORY AIR-DUCTED        GAS AC-CENTRAL
## 6        2522         01        RES  1.5 STORY AIR-DUCTED        GAS AC-CENTRAL
## 7        1312         01        RES    1 STORY AIR-DUCTED        GAS    AC-NONE
## 8        2870         01        RES  2.0 STORY  HEAT PUMP   ELECTRIC AC-CENTRAL
## 9        3871         01        RES    1 STORY AIR-DUCTED   ELECTRIC AC-CENTRAL
## 10       3400         01        RES  2.0 STORY AIR-DUCTED   ELECTRIC AC-CENTRAL
##                extwall  foundation numfirepla fireplaces bldggrade fullbaths
## 1         STUCCO HRDCT CRAWL SPACE          1        FP4 EXCELLENT         6
## 2           FACE BRICK CRAWL SPACE          1        FP2      GOOD         2
## 3  HARDIPLK/DSGN VINYL CRAWL SPACE          1        FP2 EXCELLENT         4
## 4           FACE BRICK CRAWL SPACE          0       <NA>   AVERAGE         1
## 5           ALUM,VINYL CRAWL SPACE          1        FP2      GOOD         2
## 6           FACE BRICK CRAWL SPACE          1        FP4      GOOD         2
## 7         WOOD ON SHTG CRAWL SPACE          1        FP3   AVERAGE         1
## 8         WOOD ON SHTG CRAWL SPACE          1        FP3      GOOD         2
## 9           FACE BRICK CRAWL SPACE          1        FP3   AVERAGE         3
## 10        WOOD SHINGLE CRAWL SPACE          1        FP3      GOOD         3
##    halfbaths bedrooms units landusecod physicalde propertyus    descproper
## 1          1        7     1       R122         AV         01 Single-Family
## 2          1        3     1       R122         AV         01 Single-Family
## 3          1        5     1       R122         AV         01 Single-Family
## 4          1        3     1       R100         AV         01 Single-Family
## 5          1        3     1       R100         AV         01 Single-Family
## 6          1        4     1       R100         AV         01 Single-Family
## 7          0        2     1       R120         AV         01 Single-Family
## 8          1        3     1       R100         AV         01 Single-Family
## 9          0        3     1       R122         AV         01 Single-Family
## 10         0        4     1       R122         AV         01 Single-Family
##    shape_Leng shape_Area musaID toPredict age      sqft   pricesqft totbaths
## 1    630.5433   19277.33      7 MODELLING  17  21780.00 0.016562738      6.5
## 2    633.6883   23621.33      8 MODELLING  25  21780.00 0.023545946      2.5
## 3    618.1447   18101.26      9 MODELLING   4  18099.18 0.007173674      4.5
## 4    565.5071   18096.04     10 MODELLING  42      0.00 0.000000000      1.5
## 5    686.9222   19762.49     14 MODELLING  19      0.00 0.000000000      2.5
## 6    575.6071   20075.25     16 MODELLING  30      0.00 0.000000000      2.5
## 7   1475.0216  102353.15     21 MODELLING 112 106286.40 0.390758824      1.0
## 8    650.2860   25235.96     22 MODELLING  34      0.00 0.000000000      2.5
## 9    942.4561   55826.73     26 MODELLING  46  55843.92 0.068143893      3.0
## 10   702.0724   25346.26     27 MODELLING  57  25918.20 0.035119512      3.0
##    quality school_nn1 grocery_nn1 park_nn1       GEOID HH_inc college_perc FID
## 1        6   6976.639    3974.910 2579.115 37119006220  92038     0.293934  24
## 2        4   7157.136    3970.221 2737.943 37119006220  92038     0.293934  24
## 3        6   7101.836    3929.905 2715.030 37119006220  92038     0.293934  24
## 4        3   6951.205    4149.446 2444.677 37119006220  92038     0.293934  24
## 5        4   6326.410    3685.299 2338.371 37119006220  92038     0.293934  24
## 6        4   6150.081    3232.795 2156.501 37119006220  92038     0.293934  24
## 7        3   5882.428    2654.306 1504.839 37119006220  92038     0.293934  24
## 8        4   6186.072    2484.837 1378.492 37119006220  92038     0.293934  24
## 9        3   5861.251    2384.744 1175.440 37119006220  92038     0.293934  24
## 10       4   5941.004    2294.824 1129.048 37119006220  92038     0.293934  24
##    MIDD_NAME Shape_Leng Shape_Area                  geometry NUM_FLOORS.cat
## 1    Bradley    2486536 1051124394   POINT (433648.4 188499) Up to 2 Floors
## 2    Bradley    2486536 1051124394   POINT (433609.9 188678) Up to 2 Floors
## 3    Bradley    2486536 1051124394 POINT (433660.1 188638.3) Up to 2 Floors
## 4    Bradley    2486536 1051124394 POINT (433492.7 188406.7) Up to 2 Floors
## 5    Bradley    2486536 1051124394   POINT (434138.4 187989) Up to 2 Floors
## 6    Bradley    2486536 1051124394   POINT (434656.6 187965) Up to 2 Floors
## 7    Bradley    2486536 1051124394 POINT (435429.3 187849.1) Up to 2 Floors
## 8    Bradley    2486536 1051124394 POINT (435415.7 188153.8) Up to 2 Floors
## 9    Bradley    2486536 1051124394 POINT (435774.5 187865.8) Up to 2 Floors
## 10   Bradley    2486536 1051124394 POINT (435819.2 187949.2) Up to 2 Floors
##        NUM_BEDS.cat       NUM_BATHS.cat price.Predict price.Error
## 1       5+ Bedrooms        5+ Bathrooms     1815081.0   500081.00
## 2  Up to 4 Bedrooms Up to 2.5 Bathrooms      575184.5  -349815.51
## 3       5+ Bedrooms   Up to 4 Bathrooms     1208467.6 -1314532.39
## 4  Up to 4 Bedrooms Up to 2.5 Bathrooms      209874.8   -40125.17
## 5  Up to 4 Bedrooms Up to 2.5 Bathrooms      566625.6  -108374.42
## 6  Up to 4 Bedrooms Up to 2.5 Bathrooms      576075.2   154075.23
## 7  Up to 2 Bedrooms Up to 2.5 Bathrooms      430946.5   158946.46
## 8  Up to 4 Bedrooms Up to 2.5 Bathrooms      652476.2    64976.16
## 9  Up to 4 Bedrooms   Up to 4 Bathrooms      613591.7  -205908.28
## 10 Up to 4 Bedrooms   Up to 4 Bathrooms      774286.2    36286.22
##    price.AbsError  price.APE lagPriceError
## 1       500081.00 0.27551443    -384842.62
## 2       349815.51 0.60817967    -214863.32
## 3      1314532.39 1.08776800     -21919.94
## 4        40125.17 0.19118617    -270213.69
## 5       108374.42 0.19126285            NA
## 6       154075.23 0.26745679            NA
## 7       158946.46 0.36883111    -127849.90
## 8        64976.16 0.09958396    -201622.99
## 9       205908.28 0.33557864     -54878.95
## 10       36286.22 0.04686410            NA
moranTest <- moran.mc(charlotte.test$price.Error, 
                      spatialWeights.test, nsim = 999, na.action=na.exclude, , zero.policy = TRUE)

# Observed and permuted Moran's I 
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

Accounting for Neighborhood

We introduce neighborhoods into the model in an attempt to account for spatial bias in our predictions. Specifically, we have included “Census Block Groups” as it was readily available information. The addition of this variable increases the R-squared of our model, meaning it is able to explain more of the observed variance with neighborhoods included than without it.

#Adjusting for neighborhod
reg.nhood <- lm(price ~ ., data = as.data.frame(charlotte.training) %>% 
                                 dplyr::select(price, heatedarea, 
                                               quality, NUM_FLOORS.cat,
                                               NUM_BEDS.cat, NUM_BATHS.cat, 
                                               park_nn1, grocery_nn1,
                                               age, HH_inc, college_perc))
summary(reg.nhood)
## 
## Call:
## lm(formula = price ~ ., data = as.data.frame(charlotte.training) %>% 
##     dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat, 
##         NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age, 
##         HH_inc, college_perc))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1700590   -84337    -8355    62079 43777952 
## 
## Coefficients:
##                                      Estimate   Std. Error t value
## (Intercept)                      -294136.1494   39669.6022  -7.415
## heatedarea                           136.5569       4.4684  30.561
## quality                           197602.6787    5045.2573  39.166
## NUM_FLOORS.catUp to 2 Floors       58969.3580   21485.5243   2.745
## NUM_FLOORS.catUp to 3 Floors       35852.5896   27889.6130   1.286
## NUM_BEDS.catUp to 2 Bedrooms       54004.0648   15516.9983   3.480
## NUM_BEDS.catUp to 4 Bedrooms       39899.3091   10175.7475   3.921
## NUM_BATHS.catUp to 2.5 Bathrooms -431122.6701   24530.2029 -17.575
## NUM_BATHS.catUp to 4 Bathrooms   -439686.2605   22236.6845 -19.773
## park_nn1                              -4.2150       3.3004  -1.277
## grocery_nn1                          -21.9551       2.2168  -9.904
## age                                 2294.5786     121.2829  18.919
## HH_inc                                 0.7781       0.1164   6.687
## college_perc                       24732.5021   32799.3916   0.754
##                                              Pr(>|t|)    
## (Intercept)                         0.000000000000126 ***
## heatedarea                       < 0.0000000000000002 ***
## quality                          < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors                 0.006063 ** 
## NUM_FLOORS.catUp to 3 Floors                 0.198623    
## NUM_BEDS.catUp to 2 Bedrooms                 0.000502 ***
## NUM_BEDS.catUp to 4 Bedrooms        0.000088405034962 ***
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms   < 0.0000000000000002 ***
## park_nn1                                     0.201564    
## grocery_nn1                      < 0.0000000000000002 ***
## age                              < 0.0000000000000002 ***
## HH_inc                              0.000000000023232 ***
## college_perc                                 0.450824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 395800 on 25554 degrees of freedom
##   (2146 observations deleted due to missingness)
## Multiple R-squared:  0.3201, Adjusted R-squared:  0.3197 
## F-statistic: 925.4 on 13 and 25554 DF,  p-value: < 0.00000000000000022
charlotte.test.nhood <-
  charlotte.test %>%
  mutate(Regression = "Neighborhood Effects",
         price.Predict = predict(reg.nhood, charlotte.test),
         price.Error = price.Predict- price,
         price.AbsError = abs(price.Predict- price),
         price.APE = (abs(price.Predict- price)) / price)%>%
  filter(price < 5000000)

charlotte.test <-charlotte.test %>%
  mutate(Regression = "Baseline")

sales_predictions.sf <- CLT_internal %>%
  mutate(price.Predict = predict(reg.nhood, CLT_internal)) %>%
  filter(toPredict == "CHALLENGE")

sales_predictions.df <- as.data.frame(st_drop_geometry(sales_predictions.sf))
sales_predictions.df <- sales_predictions.df[c(30,43)]


write.csv(sales_predictions.df,"Quisqueyanes.csv", row.names = FALSE)
bothRegressions <- 
  rbind(
    dplyr::select(charlotte.test, starts_with("price"), Regression, MIDD_NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)),
    dplyr::select(charlotte.test.nhood, starts_with("price"), Regression, MIDD_NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)))    

Visualizing Neighborhood Effects

While controlling for neighborhood improved our model, the difference is small and hard to notice visually like this.

#Neighborhood effect results
bothRegressions %>%
  dplyr::select(price.Predict, price, Regression) %>%
    ggplot(aes(price, price.Predict)) +
  geom_point() +
  stat_smooth(aes(price, price), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(price.Predict, price), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  plotTheme()

#Scatter plot of MAPE by neighborhood mean price
npa.mean.sf <- charlotte.test %>%
  drop_na(price.APE) %>%
  group_by(MIDD_NAME) %>%
    summarise(mean_APE = mean(price.APE))

npa.price.sf <- charlotte.test %>%
  drop_na(price.APE) %>%
  group_by(MIDD_NAME) %>%
    summarise(mean_Price = mean(price.Predict))

MAPE_by_NPA <- merge(st_drop_geometry(npa.mean.sf), npa.price.sf, by="MIDD_NAME")

ggplot(MAPE_by_NPA, aes(mean_Price, mean_APE))+
  geom_jitter(height=2, width=2)+
  ylim(-5,5)+
  geom_smooth(method = "lm", aes(mean_Price, mean_APE), se = FALSE, colour = "red") +
  labs(title = "MAPE by Neighborhood Mean Sales Price",
       x = "Mean Home Price", y = "MAPE") +
  plotTheme()

# ggplot(npa.mean.sf, aes(x=MIDD_NAME, y=mean_APE)) +
#   geom_point(alpha=0.5) +
#   labs(title="Sales Price vs. Prediction Error", subtitle="Charlotte Area Home Sales") +
#   ylab("Mean Absolute % Error") +
#   xlab("Observed Sales Price") +
#   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
st_drop_geometry(charlotte.test) %>%
  group_by(MIDD_NAME) %>%
  summarize(MAPE = mean(price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(CLT_neighborhoods) %>%
    st_sf() %>%
    ggplot() + 
   geom_sf(aes(fill = MAPE), color=NA) +
      scale_fill_gradient(low = palette5[5], high = palette5[1],
                          name = "MAPE") +
  geom_sf(data = charlotte.test, colour = "black", show.legend = "point", size = .05) +
      labs(title = "Mean test set MAPE by Block Groups", subtitle="Charlotte Metro Area") +
      mapTheme()

Conclusion

Results

Our variables are able to explain ~50% of the variation observed in Charlotte home prices (R squared). Our mean absolute error (MAE) is 106593.9, indicating that on average, our model’s price predictions differ from actual home prices by ~$106,593. This is fair and may be due to outlier (costly homes) included in our dataset. Ultimately, the model is generalizeable, but requires some tweaks before our Zillow pitch meeting :)

Discussion

Overall, our model is sufficient and can be strengthened with a few modification. For example, for this project we were not allowed to use sales data towards our algorithm. This rule added a level of difficulty because sales data is a strong predictor towards home valuation (e.g. the sales’ price of your neighbor’s home is likely very similar to the sales price of your home). We were able to make up for this by emphasizing other variables such as # of floors, bedrooms, bathrooms,housing quality, the quality of nearby schools, as well as a home’s proximity to nearby grocery stores.